32 



{ - Comb. Chem. 1999, /, 32-45 



Beyo„d Mere Di ve rsity: Tailoring c „ mbinatorial fcr 

Discovery 
&1C 1 M^*-* and Roger E. Critchlow* 

r mace Avenue, Santa Fe, New Mexico 87501 
Received June 30, J 998 

drug discovery. A library can be better "tailo ed" bv i f ''^ ,mpi ' Ct g0od ,lbra 'y design for 
as polar, pharmacophoric, rigid, , ow m0 u ^ Z^ T » ^« ™£ 

eps of D- 0 ptima< design generates diverse de^^SSZ^ "'"P'^ by SUC — 

these properties. Comparing the diversity scores a.XZ r r m Wth desirable P™™<* of 

Physical property distributions, ^SKj2? —Is the tradeoffs betwee^ diversity, 
can be calibrated by scoring the best designs ffcnTSS o £" n ^ honc ^ diverS ' ty score " 

asses of substituents or by randomly eliminat Z ; c2 2c ! ^ ""^ ^ f '" 0m S P ecific 
de igns are compared even to highly biased optimal de it f I T Shows h ° W rand ™ 
between computational and synthetic medicina c m f o ^ 7" 8 ****** effo " 

developed to integrate substructure searching disptay and'^ri^ • mteracrive *<>*"™ has been 

""met™ for the effective design of well-tLreS Ees e *P™< design to facilitate this 

Introduction 

best set of substituents for a combinatorial synthetic 
cheme to maximize the chances of finding a ™S 

oSTb ;r; as a r s ,ead - rnitiai eff ° ns - 

tonal library design focused primarily on maximizing 
.nformation content and minimizing redundancy b^TaS 
m .zmg "divers,ty", allowing some bias by lcl71 

° f P h ~ph y oric si^ 

screen.H T ™ ^ designed > synthesized, and 
screened, and potent ligands were identified." Inspection 

■nso ub^'l ° W h e r r ' r T l6d th3t many Were ™ re S 
Tnre ° r , h,gher m0 ' eCUlar wei « ht than would 

be preferred in a drug lead. This outcome underscored the 
many ac t ors beyond dlversjty which ° ed the 

nato ia hbrary design for drug discovery. Mo.eLar w ^ t 

2 P ° P l,% ° f Synthesis ' Phamacophore foct 
ngidity, reagent costs, solubility, incorporation of common 

med.c.nal-chemical intuition should all be taken into 
account. Merely maximizing diversity has been show to 
systemat.ca.ly bias the library away from the desir 71 es 
for many 0 f these properties. The goal of , iblH £ 3 
should be to provide high structural diversity wL C T 

ttjSSXA." .^ve developed 
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diversity while emphasizing desirable attributes and identifv- 
ing the tradeoffs which are inherent in the chem.stry One 
can quickly S e e how much d.ersity ,s sacrificed b/usmg 
f we, groups that require protection or how many and which 
Pharmacophoric fragments m.ght best be included'in a tlrget 

f.guie I. Suitable reagents are identified from a database 
of commercially available compounds. Structural propert 
are calculated for each candidate substituent. Fro m the 

simila. ty. The substituents are also divided into "bins" based 
on addmonal properties, besides diversity, which are im- 
portant ,n small-molecule drugs. Finally a sm 1 et of 
subst,tuents are selected from the candidal s that ximize 
diversity w, , e at the same time satisfying a specifi d p ofi 
of these .ddinonal properties. The following example will 

WdH^ 8 ^S C1 '' ,CllCmiCalS0Ci ^ 
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illustrate in detail all of the steps i„ generating a typical 

y red design and win present an 

Background Theory 

Similarity, Property Space, and Multidimensional Scal- 

email twn'T * f ° r 3 Atonal I £y 

entails two steps: the calculation of a "property space" Z 
which proximity between substituents reflect^ struck 
s-mjlanty and the subsequent selection of points It a " 
well distributed throughout that space. If 'he ta, tU r 

Zered 8 ^ ^ituents^ 

clustered m property space are potentially redundant, where I 
a widely and evenly dispersed set is "diverse" 
A vector of properties for each substituent can be regarded 

t i sT? na dS m 3 TT* SPaCe ' ° ften ' h ~' * 
tics (i.e distances) between compounds can be directly 

calculated, but the coordinates cannot. Given a table of 

properties (coordinates), the Pythagorean theorem can easij 

(dista P ntV^r PUte 3 matriX ° f a " Painvise **mihriti« 
(distances) The reverse operation, calculating "latent" 
propert.es for each substituent from . dissirnilar f ty ™[ 
is known as multidimensional scaling (MDS) It is comm. ' 
**J*.y expensive, and if the similarity coefflcL T J 
a metric -and the Tanimoto similarity coefficient used 
w-dely ,„ drug discovery problems is not a metric-it „ 
only be ach.eved approximately.* It ,s currently practical ^ 
sets of up to about .0000 substituents, adequate for mos 
substituent selection problems. 6 St 
An algorithmic definition of "diversity" should incorporate 
two key concepts: "redundancy" and "coverage". A nonre! 
dundant set of points is widely separated in space A set of 
Pom s covers space if all regions of space, Jin parti™ ar 

a methrf S,0nS * ^ ^ m Whi,e "ere 

are me hods for measunng the "diversity" 0 f a set of points 
that : only require a matrix of the distances between pairs of 
candidates coordinates are required for the powerful 
nance-based methods, such as D-o P timal or A-optimal design 
(see D,scussion). As an illustration, consider trying se ^ 
adiverse geographical sampling of the country A map (, e 

nLT;" 316 ^ W ° Uld be m ° re ^ tha "J»* - table of 
ntercity distances. Methods based only on distances typically 
ca select po.nts spread out in space, i.e., nonredundant sets 

hlf i n n e the y """h ide " tify C ° 1,inearitieS - —presented 
bet f ^ etermme Whether 311 dimensi °ns »ave 

centerTfTh I' " Whether 3 P ° int ,ies llear 

center of the space or near the edges. In short, pure distance- 
method ' redUndanCy °" ,y ' but -o'dinate-based 
methods assess coverage as well. The extra effort to apply 

JJtedT 7? " therefore J- tifi ^, since a Zl 
tailored ibrary greatly benefits from actual coordinates rather 
than just a distance-based selection. 
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optimization of synthetic processes,'^ the desien of calihr* 
t- standards in analytica, chemistry^ 
sc.een.ng subsets from corporate chemical archives « D 
opumal design has recently been applied to selecting sma^i 
numbers of substituents from , arg er sets of suitable e™ 
to^use .n synthetic comb.nator.al libraries for drug d£t 

In cases where any combination of property values can 
be ach.eved, such as time, temperature, and reagent co cen- 
fations m a synthesis, precomputed classical desig^s are 

S a on,/:,: ho r ing an r c, " ai 

olte. only discrete, poorly distributed combinations of 
propert.es algorithm, designs such as D-optim a 0 A 
opt-ma des.gn are preferred. ,„ algorithm* de signs a 
candidate set" is identified, and a much smaller "d sign 
set ,s chosen f.- om the candida ^ 

An in. .al set of substituents can optionally be preselected 
for inclusion ,n the des.gn, and that set can be "augmented" 
o the desired size, choosing the remaining membeTsl as 

can then be used as the mitial set in a subsequent desien 
augmentation. This capacty to build up a complex 2 
fiom successive augmentations is the bas.s for our approach 
to tailoring library designs. approach 

Methods 



Experimental Design 

larSrtoten 3 ^ ^P 6 ™ to best represent a much 
arger potential cand.date set falls under the discipline of 
experimental design". Experimental design has been' P ied 
o many pharmaceutical problems, including the design of 
structure-activty relationship (SAR) compound sets .^i the 



Property Calculations. The property space of the sub- 
stituents was calculated as previously described.' The space 
included the calculated octanol/water partition coeffi ient 

a X p Lf e 8 7; ptor , s der ; ved from principa] 

analysis of 81 topological indices, "chemicai functionality" 

f r MDs of Tanimot ° simi,arit - ^ 

on Daylight 2-D substructure fingerprints, and "receptor 
^cognition" descriptors derived from MDS of sin 2 e 
from atom layer tables", which give the distribution along 
~ sT! ° f T "^"-^ « charged" 
« , T t,C g, ' 0l ' PS ' The ,n,mbers of chemical 

ftnct.on.ljy and receptor recognition descriptors were 
determined as the fewest MDS dimensions required to 
reproduce the dissimilarity matrix with a relative standa d 

r'stila t'° % H Th , eSe P,0Pe,tieS WCre Ch ° Se " to cha ^er 
ze similanty and diversity with respect to lipophilic^ ' 

shape, chemical functionality, and distribution of key recepS 
binding f eat The p| . operties we| . e , * J£ 

ated us.ng the program MAKESPACE, which comprises a 
collection of commercial programs, C and FORTRAN code 
and Tel scripts, all coordinated through the UNIX "make"' 

i ni,t !^ Pr ° gra , m t3keS ' iSrS ° f ,eaEents or sub «ituents 
« input. It normalizes the structures by removing counte- 
rs, standardizing resonance forms, ere. It then determines 
the best commercial source for each reagent, compute the 
similarities and properties, performs MDS to create Me 

rsrn n f oads the structuies Md p r °p— »s 

a 1 HUR database" for searching with MERLIN 18 

Substructure Searching. Tailored library design requires 
th ability ,o rapidly and interactively sort or subset a 
collection of reagents or candidate substituents by structural 
criteria or physical properties. This is used both to identify 
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feasible candidate reagents and to categorize substituents into 
w MERL m ? ? a " hlne *« Permed 

" fi^ff™ n s sets were stored as TH0R data 

tree tiles (TDT). Designs were displayed by TKPRADO an 
mteract.ve, window driven, Tcl/Tk script tha" dS'av 
structures using the Daylight PRADO utility Y 
Optimal Designs. The thousands of optimal designs 
computed to choose the substituents are set p Z Z 
•nteraeuve, window driven, Tcl/Tk program TAILOR whi h 
executes a S l lg htly modified version of the public do n ^ 

dStories to to u CreateS mani P ulat ^ fi'es and 

and wl! -rm n T T* and ° utput data - h reads 

for sTaltn? ?h 10 With the D-y^t software 

addition, TAILOR performs automatic calibration designs 
o a, mte rotation of the result and also automatical ^ 

M \7 A ^ enlentS f ° r 6ach COm P ound i" *e desig^t 
av2, MAKESPA ^ and TAILOR are not commS i Sty 
available programs, but most of the individual modules that 
compr.se them are.' MAKESPACE and TAILOR pr n ar ' 
automate their application. P> 'manly 

To circumvent the D-optimal algorithm's tendency to 

<i2::Z!T ^ ^ DlSCUSSi °")' TAILOR can 
saturate the designs, ,.e., it generates models where the 
number of terms are equal to the desired number of 

sts? modeis are buiit up syst — ^ * 

fi . duding a„ intercept and linear terms, in order, staLg 
with the largest principal component (PC). Squared terms 
can optionally be added, fo.lowed by enough cstem 
starting w,th PCI x PC2, PCI x PC3, PC2 x PC3 etc To 
saturate the model. For an dimensional property st'c 
h.s protocol covers up to (V + 3N),2 points J up t0 ' 
170 substituents for the 17 dimensions in he u rent 
example. H.gher terns can also be added if needed 
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Results 

Candidate Reagent Identification. Library design begins 
w, identifying the candidate pool of potentTal sub! ituentT 
This exam ie study beg£m ^ ^ J- 

9077 phenols in the available chemicals directory*' (A CD) 
Most of these were not suitable, being too large too 
expensive, too hard to obtain, insufficiently reactive cont^ 

s:* moieti ? s > ° r ^ - s 

chemistry to be practical. The unsuitable compounds were 
emoved by a series of substructure and property rZ 

w^uledT 8 MERLIN , Tab ' e ' Sh ° WS ** Seareh -^ 
were u ed to remove undesirable amines. The specifics are 

panicular to each reaction scheme, and filters are applkd 

To ex ml t r inU ° US,y reeVa,Uati "g the -gentlist 

step 6, but selected for removal later in step 19 Some 

which o S inC ' Uded 3 ha " d inSpeCti °" 0f the - 
w ch only some of the matching compounds were elimi- 
nated. An extensive catalog of previously compiled queres 
are avadaWe to streamline the filtering process Even s an 
experienced library designer-working closely with the 
hem,st who developed the synthetic scheme-shou^ xp c 
to spend no less than 2-4 h on this step. This is time we 



spent. Cull.ng out only the suitable reagents at the outset 

ACD under several tautomeric or resonance form, Likewise 
an unsymmetncal diamine could generate two diS 
substittients. MAKESPACE uses an extensive set of 44 
™l« to standardize" the strucn,res by eliminating cou t l 

'his set of ,ules has been developed and tested over manv 
years and correctly hand.es virtually every small mo lecule 

" tTTIil'b 0 t tab3Se | ^ ^ « 
kCpt aS / llnk back 'nto the ACD for later ordering the 
reagents. Another set of rules identifies the best vended 

Property Calculation. The amine and phenol sets were 
combined for a total of 169?, compounds',* 1 0 Zy 
calcula ions. Thus, all substituents for both positions «S 
em bedded ,„to a single property space, allowing d™ 
to be maximized between sites as well as wLn s £ 
Substituent properties were calculated by MAKESPACE as 
described above, and a THOR database was generated 
conta, Illn y he s dai „ ze[j siibst . uenr «^ 

pn es, pieferred names, and some computed properties 
nchiding ( 0 g P MW, number of rotamb.e bonds nd 
distance from the centroid of property space. The propel 
space required five shape descriptors seven chemS 
ftmctionahty descriptors, and six receptor interaction c J 
tois, as we as ott P for •» rntnl nf 10 r • p 
iu t r , ioi a total ot 19 dimensions Princinal 

components (PCs) analysis showed that ,7 PCs expir e 
^9.5/o of the vanance, so the last two PCs were disregarded 
In this example, the final substituent database conS 943 
amines and 750 phenols. 

Creation of Bins. The next task was the creation of bins 
To ^contro the distribution of properties present in the libra^ 
and to evaluate the tradeoffs between high diversity and bi2 
toward da,g-l,ke properties, one first assigns the candid te 
substituents to subsets representing desired (or undes redl 
properties These subsets of substituents to be emphasSd 
or deemphasized in the design are called "bins" Ea h 
subs ituent has many properties, and the bin catagories 
overlap, so most are assigned to several bins. Table 2 show 
he bins used for the a,n,ne-denved position. The library wi 
to emphasize rigid, polar, validated, drug-like, and pharm 
cophonc substituents. The exact number of compounds to 
e used fl . om each bin WM tQ ^ P to 

he de ,gn process. The 'Validated" bin contained 76 amine 
fo, wh,c t e y,eld and purity of me reaction had ^ 
confirmed. The "seed" bin contained 4-methoxybenzyIamS 
which was the validated point nearest the centroid oT he 

b P Svd y ,. S !T Ce ' " 7" M 4 - h ^-yP'-nethylami n e and 
Wh0se responding side chains were 
p.ev.ously found ,n potent ligands for the a,-adrenergic and 
^opiate receptors.^ The Extreme" catego^ held 2? of Te 
-.5 amines farthest from the centroid of property space Th se 
compounds were mostly complex hydrocarlo s or uga 
analogues. Typ.cal.y, many intuitively undesirable com 
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^^^^ A mina , 

descrip tion " ~~ 

keep aliphatic primary 

keep aliphatic secondary & take union 
remove MW > 250 

remove obscure vendors 

remove cost > $500/g 

remove weird elements 

long unbranched chains 

aromatic tricycJicbridgehead 

aromatic many cycles (kept a few) 

bridgehead (kept if substituted) 
fluorines 

linear Fs 
acids and enols 
Jong skinny (kept some) 
thiophenes, furans 
big rings 
epoxides 
alpha eliminators 
Br orl 

enol-ethers 
disulfides 
benzo furans 

benzoquinones 
N-0 

aldehydes 
alkyl haiides 
isocyanates 
sulfides, disulfides 
mixed, unsymmetrical diamines 
secondary diamines (keep symmetrical) 

primary diamines (keep symmetrical) 
cyclopropylamines 
elimination problems 
elimination problems 

anilines 

hindered primary amines 
hindered secondary amines 
acidic OH 
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[CX4][NH2] 
[CX4j[NH][CX4] 

N. A.' 
N. A. 

[!C!c!0!o!N!n!Sfs!F!Cl!BrfItNair-iH^im 

[aR2]a[aR2] . 
[R3] 

F- F. F. F. F F 
FC(F)C(F)F 
*=*[OH] 

fr8,r9,rl0,rll, r i2,rj3,rI4rl51 

fOAN,n][CX4][0,o,N,n] 

C=CO 
SS 

olccc2clcccc2 
0=ClCCC(=0)c2clcccc2 

0-[CH] 
[CX4J[CI,Br,!J 
N=C=0 
[SD2] 

[CX4][NH][CX4].[NH2]rCX41 
CX4j[NH][CX4].[CX4][NHircX41 
tNH2][CX4]. f NH2][CX4] J[ J 
[NH]ICCI J 

$(tNH][CX4,c]),S ([ NH2])lc J 
*[CX4](*)([NH]j[CX4](V 

[$(C#N),S(N=0),S(S=0) $( C =0) Jr r -i V - r m_i 



»ii jvicuin to 

P»mb are concent ,, ,„= ex„e,„es of proper space 
self-explanatory combinations of the sets above 



"c S »M 1 , n ' S °}™ h ™* "«8»», shown i„ Table 
of «« >W designs as fccnbcd beta 7, weTS 

"optima ( 0I S-optimal) design of 50 from the full J 
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Table 2. Bin Profiles Used 
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center 
pharma 
rigid 
FF 

extreme 
seed 

LoRgPIV 

boc 
tbu 

LoPIrV 
LoRgPlr 
LoPJr 
DrgPlr 
OKamine 
DrugV 
valid 
Drugish 
LowMW 

polar 

FF_xtrm 

good 

HC 

tyrjik 

all 



no. 



validated point nearest centroid 
from previous assay hits 
two or fewer rotatable bonds 
more than three fluorines 
more than three std dev from centroid 
center + two pharmas 
lowMVV, rigid, polar, validated 
available with boc-protected amine 
available with /-butyl-protected acid 
lowMW, polar, validated 
low MW, rigid, polar 
low MW, polar 
drug-like and polar 
diamines that need no protection 
drug-like and validated 
reaction has been validated 
contain pieces from 100 top drugs 
molecular weight < 130 
H-bond acceptor and log P< \ 5 
union of FF and extreme bins 
chemistry is expected to work well 
hydrocarbons only 
closest analogues to tyramine 
. union of al l bins above 

Table 3. cv '>- ±- ^ ■ * 



range 



I 

5 

346 
30 
27 
3 
13 
6 
9 
21 
77 
126 
88 
46 
II 
76 
151 
229 
378 
57 
699 
177 
50 
762 



tries 



0 
0 
0 
0 
0 
3 

-10-14 
I 

2 

-15-19 
5-6 
4-5 

-30-34 
I 

2-4 

2 

I 

2 

-46 

2 
2 
0 
0 
0 



0 
0 
0 
0 
0 

1 

3 
1 

2 
3 
2 
2 
3 
2 
2 
2 
I 

2 
3 
2 
2 
0 
0 
0 

Lien 

I to that number. 



fragment 



5-aromatic,[I,3] heteroatoms 
beta lactam 

5- aromatic, 1 heteroatom 

6- nonaromatic,[J,4] heteroatoms 
pyrrolidine 

5- nonaromatic,[I,3] heteroatoms 

6- aromatic, ] heteroatom 
6 : aromatic,[l,3] heteroatoms 
piperazine 

guanidine 
nitro 

piperidine 
thiazole 



count 


fragment 


21 


pyridine 


19 


furan 


16 


quinazoline 


16 


imidazole 


16 


indole 


14 


naphthalene 


13 


purine 


13 


benzimidazole 


12 


pyrimidine 


11 


quinoline 


10 


1 ,3-dioxolane 


7 


morpholine 


7 





method bin 



6 
5 
5 

4 

4 

4 

4 

3 

3 

3 

2 

2 



those random subsets. The "random" method refers simply 
to choosing 50 substituents from a set at random and * n 

T with no optimal design ste ^ AiI 

select I0ns and subsequem optimi2ations were repeated fiye 

ported 0 aVCrage SC ° re Standard deviation ^ 
Since the "all" set is the union of all bins, it establishes 
the maximal possible D-score as 136. The set of 50 closes 
tyramme analogues establishes a very nondiverse benchm k 

ld S WH°; e ^ ^ C ° nSidered a P^caltS 

oZttl, tH C J 5 S ° me C ° rreS P° nd — between the size 
of a set and the resulting score, the correlation is low. Except 
tor the very restrictive set of low molecular weight-rigid- 
polar compounds, the worst of the optimized designs Som 
restated bms is still much better than selecting 50 sub 
uents at random from the "all" set. This illustrates how Z 
poor random designs are. The penalty for eliminating 50% 
of al synthetically suitable compounds is not great, and even 
elimmattng 80% of the compounds at random llm 

alone. The low molecular weight restriction limits diversity 



100% 

50% 

20% 

100% 

100% 

10% 

100% 

1 00% 

100% 

100% 
100%, 
100% 

100% 

random 

random 

random 

random 

random 

random 

100% 

random 

random 

random 

random 

random 

100% 



all 

all 

all 

rigid 

polar 

all 

RgPIr 

valid 

Drugish 

LowMW 

DrgPlr 

HC 

LoRg 

all 

valid 

polar 

RgPIr 

rigid 

DrgPlr 

LoRgPlr 

LowMW 

LoRg 

Drugish 

LoRgPlr 

HC 

tyrjik 



^i^J^scoreD-tf S-score S-ct 



762 

381 

152 

346 

378 
76 

146 
76 

151 
229 

88 
177 
154 

50 

50 

50 

50 

50 
50 
77 
50 
50 
50 
50 
50 
50 



136 
124 
105 
105 
101 
87 
81 
80 
76 
66 
60 
61 
55 
50 
47 
39 
38 
37 
21 
16 
5 
3 

-I 

-9 
-16 
-148 



2.6 
4.5 



2.9 



3.6 
7.5 
5.8 
5.6 
8.2 
4.6 

7.8 
10.0 
7.0 
6.9 
5,4 



3.23 

3.01 

2.65 

2.63 

2.58 

2.13 

2.32 

2.11 

1.94 

2.01 

1.78 

2.00 

1.92 

L42 



0.02 
0.05 



0.09 



0.16 



0.5 



me an oinj. bm descriptions are in TihU 1 rv.i « ■ ■ „ . 
the number of Candida J in the ubL D cL a IT ^ '* 

sl.ow.ng the standard deviations for the five attempt 

drastically almost as much as a calibration bin that only 
includes the pure hydrocarbons. Y 
D-Opfinwi "Tailored" Designs. Anticipated ranges of bin 
members,,,, for the tailored design were se.ected ts snown 
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from Table 2 
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i ioiJ 8 I r-^—A-ULJiJL.^^ (30-32) (I 

9 inn * « -> 4 Q i ^ "~ — 



12 
12 
/</ 

15 
13 
17 
17 
15 



low ngid 
8-24) (12-17) 



31 

32 
30 
30 
30 
30 
30 
31 



18 
20 
21 
22 
18 
23 
24 
22 



13 
15 
16 
17 
14 
14 
14 
14 



property. E.g., the drug column is the sun, of the "Drug-polar" and S^L" X'ns ™ ^ ^ 



m Table 2. Some categories, like small, rigid, polar groups 
were emphasized. Others, like large, hyd ophobfc and 
extreme groups, were deemphasized. The D-optL design 
were all made for a quadratic model saturated with c os 
terms, using the largest 1 7 of 1 9 PCs only. Separate!™ 
were made for each of the 1320 profiles consistent with tht e 
anges and . total design size of 50 amines, i.e., each d ,> 

set andl generated fr0m 3 Pr0f " e with m <= "seed- 

set and then augmenting by D-optimal design with the 

specified number of substituents from bin "LoRgPl V' Thi 

combmcd set was further augmented with the "hoc" set and 

orderw™ b ™ *"* ^ W^'y sampled. The 
order was chosen to sample the most restrictive sets early 
so *e_ optimization algorithm has broad latitude to fill' 
.ema.n.ng holes ,n property space from the most general sets 
toward the end. Since the sets can overlap, the spec fied 
ranges . actually determine a minimum bias, e. the sde 
from the final "good" bin might also be LowMW an d/or 
p ar and/or rigid as we.,. Likewise, sets to be deemphasized 

nil * ' • " 0t ° VerlaP 3 g6neral Set ( SUC " ^ gOOd) 

unless the set ,s never actually sampled (such as ALL which 
has a ra f o) Each D . optlmal step was ^ 
nines as « h0 the w ^ tQ ^ 

W.th 1320 profiles times 34 D-optimal steps per profile this 
required performing 44 880 separate D-optimal designs. The 
nnre calculation took about 1 day elapsed time on an SG 
Indigo 2 with a 150 MHz R4400 CPU running IRIX 5 2 

Some notable designs are presented in Table 5, in order 
of decreasing diversity. The most diverse of the 1 320 tailored 
designs had a score of ,02. Comparison to the benchmar 

W/ li reVea i S i that tHiS SC ° re iS C ° mparable t0 elim i"ating 
80 /„ of the candidates at random or to the best designs from 

duln° r 5 SUbS6tS - ThiS Significant ' but acceptable, 
reduction in diversity ,s the penalty paid to achieve a profile 

of properties suitable for bioavailable drugs. The worst 
a ore design had a score of 83, slightly Lse than " 
optimal design made after eliminating 90% of all feasible 
candidate, at random, but still much better than a simple 
random selecfon of 50 compounds from all feasible cand' 
dates, with an average score of 50. It is usually possible to 
generate carefully tailored designs that maintain nuich oft 
possible diversity. 

Table 5 also lists the bin membership from those bins for 
which a range was specified. Some bins combine several 
over appmg categories, so Table 5 also shows the total 
number of drug-like, valid, polar, , ow Mw> and ™ 



™ZZ^T^ CaCh , de ?, i8n < e *' *e "drug" column 
the sum of the drug-polar" and "drug-valid" columns 
urpnsmg y t he most diverse design, no. l , has an ,nu ual.y 
gh drug ,, ke b,as, having | 2 of a possib|e 14 mef ^ J 
J "rug-like bins. However, its members tend to be high 

7^ 2:: g \ 1 T XMC ' iUld Ve, ' y f6W Were va,id ^ 

low MW nH w ? • g "' ike SUbStitl,entS ' but has b ^ 
ow MW and rigid bias. Design 67, with a score of 96 is 

die most iverse of the designs with the maximum 17 r gid 

with the full 17 vahdated substituents. The most diverse 
among these is no. ,78 with a score of 94.7. This design 
fairly good ,„ every category and would be a good candfda te 

SZ 70 S ;iT Pt T- ' tS * "'"^ » "'-PPO'ntingly low 
Design 470, the best design with the maximum 24 low MW 
substituents, scores only 92.4. 

Since this example library was intended primarily for broad 
screening, a high diversity score was desired. Design 35 
with a score of 97, was chosen as a good compLta 
between diversity property bias, and synthetic ease. Even a 
very thorough imtial review of the candidate reagents rarely 

r^fT 0 Un i deSirable , SLlbStimemS - 1,1 this Case ' " 
t.on of the 50 chosen substituents revealed that one was a 

Apept.de, wh,ch might cause formulation, delivery and 

me abosm problems. A second, more general compJn't was 

ha the horary conta.ned ,0 substituents with amide 

hydrogens, which are believed to carry similar liabilities * 

A new design rectified these deficiencies. The six dipeptide 

reagents re ehmmated from all bins by a substalre 

Tic ed h ,n 'li r S , StitUemS With <1mide Pr0t0ns was 
-educed by allowmg them in the most elite bins but 

removing them from the more general LoPIr, LowMW polar 
FF xtrm, good, Drugish, and valid bins. The new bin sizes 
and design profile are shown in Table 6. This profile, which 
was focused around the previously favored design 35 
contained only ,8 possible designs, so the calculation took 
only 20 mm. The results are shown in Table 7. The best 
design had a D-score of 96, showing only a small d.versity 
penalty for this improvement, and only 6 of 50 substituents 
now had amide hydrogens. For this library, the design was 
deemed acceptable. More typically, however, further analys 
would Me a d to additional cycles of evaluation and refinement 
ach takmg about 10-20 min, before achieving a fina 
acceptable design. 

Having completed the amine design, a 50-member phenol 
design was created by using the amine design as a 50- 
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seed 

LoRgPIV 
boc 
tbu 

LoPlrV 
LoRgPlr 
LoPIr 
DrgPlr 
OJCamine 
DrugV 
valid 
Drugish 
LowMW 
polar 
FF^xtrm 
good 



center + two pharmas 

low MW, rigid, polar, validated. 

available with boc-protected amine 

available with /-butyl-protected acid 

low MW, polar, validated 

low MW, rigid, polar 

low MW, polar 
drug like and polar 
diamines that need no protection 
drug like and validated 
reaction has been validated 
contain pieces from 100 top drugs 
molecular weight < 130 
H-bond acceptor and log P < J .5 
union of FF and extreme bins 
chemistry is expected to work well 



21 
77 
112 
88 
46 
I I 
76 
139 
215 
331 
54 
649 



range 

3 

9- 
I 

2 

0-1 

5 
5 

6-8 



4 
2 
I 

2 

-46 
2 
1 



2 
2 
3 
I 

2 
2 
1 

2 
3 
2 
2 



SP.nrp r n D„Di\; i . , ~ . L_ 



rank 
1 

2 
3 



10 



12 
13 
14 
15 



score 
(86-96) 

96 
95 
93 
93 
92 
92 
92 
91 
91 
91 
90 
89 



LoRgPIV 
(9-11) 

10 
9 



10 



LoPlrV 
(0J) 
0 
0 
0 
0 



10 
10 



10 



10 



DrgPlr 

(6^8) 

7 
6 
7 
6 
8 
6 
8 
8 
7 
7 
8 
7 
6 
8 
7 
6 



polar 
(1^4) 

3 
5 



87 
87 
87 
86 

member "seed" bin and augmenting it from phenol candidate 
bins to yield a tailored design of 100 members. The tailoring 
of the phenol design was completely analogous to the amine 
design just described. Including the amine derived substit- 
ute m the phenol diversity design ensures high diversity 
between positions in the final library as well as within 
positions. 

At this point the reagents could be ordered and validated 
Inevitably, some reagents will be out of stock, and others 
will fail in validation. To save time later, Tailor automatically 
generates the best D-optimal alternates for each member of 
the design, so that they can be ordered in advance and 
validated m parallel. To keep the property profile unchanged 
each alternate is chosen from the same bin from which the 
original substituent had been selected. Figure 2 shows some 
examples. In the first two cases, the best replacement was 
an obvious analogue of the original substituent, but in the 
other two examples, the D-optimal algorithm completed the 
design by selecting a radically different structure, presumably 
to fill another large hole in diversity space. The latter 
examples are more useful, because if the original selection 
fails ,n validation, that does not suggest that the replacement 
will hkewise fail. Since the same reagent was often the best 



drug 
(10-12) 

1 I 
10 
II 

10 
12 
10 
12 
12 
1 I 
1 1 
12 
1 1 
10 
12 
II 
10 
12 
10 



valid 
(I3-I_6) 

14 
13 
13 
15 
13 
14 
15 
14 



15 
14 
16 
14 
15 
15 
15 
16 
16 



polar 

00) 

30 

30 

30 

30 
30 
30 
30 
30 
30 
30 
30 
30 
30 
30 
30 
30 
30 
30 



low 
(19-21) 

20 
19 
19 
21 
19 
20 
21 
20 
20 
21 
20 
22 
20 
21 
21 
21 
22 
22 



rigid 
(14-16) 

15 
14 
14 
16 
14 
14 
15 
14 
14 
15 
15 
16 
15 
16 
16 
15 
16 
16 



alternate for more than one substituent and it is helpful to 
find a dissimilar replacement, second, third, and higher 
alternates are often generated. Order sheets were automati- 
cal y generated for all reagents, sorted by preferred vendor 
including price, structure, name, and catalog number. ' 

Discussion 

Whole Molecule v S Fragment B ; ,sed Properties. The 

des.gn of the tailored library was based on the calculated 
property of the fragments from the variable positions in 
the hbrary, rather than on the assembled final molecules 
This substituent approach takes advantage of the inherent 
structural similarities between all of the members of a 
combinatorial library. It assumes that diverse substituents 
generate diverse libraries. The obvious risk of working with 
substituents is that a design based on fragment properties 
does not explicitly account for interactions between the 
fragments ,n the assembled molecules, so assembling diverse 
substituents might not result in diverse molecules. However 
since a combinatorial library includes every combination of 
substituents, it does form a full factorial design to characterize 
any interactions implicitly. The whole molecule alternative 
is limited by computer resources. If any of 1 000 substituents 
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Figure 2. Sample alternates chosen by D-optimal design In the 
first two cases the best replacement was a „ obvl s a 1 1, L of 
the onginal substituent. In the other two examples the D on ZZ 
a gonthrn completed the design by selecting ' di ,y d £ 
structure, presumably to fill another large hole in diversity spice 

could be put at each of 4 positions on a scaffold the 
enumerated virtual library would contain 10" assembled 
compounds. The method just described can be performed 
routinely on 4000 substituents in a few days, which has been 
sufficient for every l.brary design problem we have encoun- 
tered Any approach which could be applied to even 10' 
assembled molecules could only use the most primitive of 
property spaces and selection methods. We believe the 
assumption that diverse substituents yield diverse libraries 

IS* f 7 S tha " aSSUmpti ° n that ve '-y simplistic 
property calculations and selection procedures would yield 
effective libraries. 

Furthermore, property space formed from whole molecule 
calculations may not describe the diversity of the final 
combmatonal library as well as the corresponding substituent 
calculations. For example, the 2-D Daylight fingerprint 
descriptors only include paths of up to seven bonds 
However, diversity libraries based on side chain descriptors 
tor three positions incorporate information on up to 2 1 bonds 
per compound in the final library. Furthermore, including 
he scaffold ,n the calculation disgu.ses information about 
the substituents. Any fingerprint bits set by the scaffold are 
set ,n every single molecule, so the presence or absence of 
those substructures cannot be distinguished in the side-chains 
Patterson et al. described this phenomenon when they found 
tha 2-D fingerprint similarity for substituents correlated with 
biological activity consistently better than the corresponding 
whole molecule descriptors across 20 quantitative SAR data 
st ft > from the literature.- We have tried designing libraries 
where each substituent was attached to the scaffold before 
calculation of the property space, and we likewise found that 
the computed similarities between members increased and 
many important structural distinctions were lost. Again these 
arguments apply specifically to the design of combinatorial 
libraries, such as those made by split and mix resin synthesis 



Journal of Combinatorial Chemistry. 1999, Vol. I, No. I 39 
or parallel-array" synthesis. Other diversity design problems 
such as selecting subsets of corporate archive, or purchasing 
compounds from collections of arbitrary structures, req2 
methods that can deal with whole molecule descriptors 

D-Optimal Design. D-optimal design works by choosing 
a subset of substituents from a large candidate set" maximiz 
mg the determinant of the "information matrix", |X'X| for 
a design matrix X." The rows of X are the substituents and 
the columns are the "model terms", i.e., the property space 
chmensions, and/or higher order terms such as their squa 
or cross terms. Th.s minimizes the determinant of the iL 
and, t us the prediction error for a regression model 
Equivalent^ information theory shows that this sam 
criterion maximizes the expected entropy change, ie it 
sdects the set of substituents that together carry the mo 
information for estimating the model." Roughly speaking 

r d ™o 8 f 1 r' eter 7T t ' eqUireS e '« 
and small off-diagonals. Th.s implies large variances, so the 

elected points are well spread out, and small covariances 
sam C p 0 ler i,nt ' eS ^ ^ dimensions are 

In substituent-basecl, tailored, combinatorial library designs 

e s , 2 m,n ; r of 7 stia,ents * each stage ° f - g "l„ 

sma "' and the "umber of dimensions in property space is 
comparably large. In fact, the dimensionality must often 
be reduced by principal components analysis, so that the 
number of dimensions (degrees of freedom) does not exceed 
the number of substi tuents. Since covariance-based methods, 
I'ke D-optimal design, minimize collinearities as well a 
max.imzmg spread, they should still produce "balanced- 
designs that optimally sample all dimensions, even when 
here are few or no extra degrees of freedom. Since simple 
distance-based methods ignore collinearity, they are likely 
to select sets of points that do not sample the full dimen- 
sionality of the property space. 

A simple 3-D problem was devised to graphically illustrate 
the advantages of covariance-based methods such as D- 
optimal design, over a standard distance-based method at 
sampling the full dimensionality of property space. Although 
property space is often presented as a hypercube, examining 
our property spaces computed from many actual substituent 
sets (including this study) showed that the distribution of 
points was always roughly elliptical; so for this example a 
random se, of test points was generated inside an ellipsoid 
from the equation X VV- + f, 0m9 2 + = , J 

first three principal components of real data sets gave results 
ha were essentially the same as this test case.) The simplest 
distance-based method that has been recommended for 
substituent selection is "MaxMin", which selects points to 

set" WH ! H ma " eSt ne ™S hb0 '- "itance in the design 
set. Wh.le th.s criterion may work for pure diversity 
designs, it ,s inappropriate for tailored designs, since one 
often purposely includes some pharmacophore analogues 
that he in close proximity in property space, and these would 
dominate the MinMax score. Other distance-based methods 
spread points out in space by maximizing various averages 
of all of the near-neighbor distances or of all pairwise 
distances. The "S-o P timal" method ("S" for " sprea d") 
maximizes the harmonic mean of distances between each 
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3 points 



4 points 



5 points 



6 points 




3 points 




5 points 



4 points 
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6 points 

Figure 3. (a) Sample 3-D S-optimal designs for three four five 
and six points including a center point Random candid t poin^ 

^f^^^ lll T^ from the equati0 " + 5 

r , , \~ L (b) Sam P ]e 3 " D D-optimal designs usim* n 

omT Ra°nd ' ^ ^ * ^ ^ ^ ^ tS - ™ 

fi^m th ""^f 6 P ° mtS Were ^^rated inside an ellipsoid 

from the equation x 2 /\ 2 + y/0.92 + z 2 /08 2 = , cl "Psoia 

chosen point and its nearest neighbor in the design." Since 
it includes all near-neighbor distances but the harmonic mean 
gives extra weight to the shorter distances, it is well suited 
for diversity design. Criteria that maximize all pairwise 
distances rather than just near-neighbor distances, tend to 
flatten the designs into the largest few dimensions 

exa^r 1 7, Sh0Wn in FigUreS 3 ^ Tn e ^h 

example, inclusion of the origin was forced as a required 

bin. Figure 3a shows the most "diverse" sets of three four 

Jve, or s,x points, including the fixed center, selected by 

S-optimal design. The three-point design is collinear. Four- 

and five-point designs are coplanar, so the V property is 



Table 8. Results from Designs of 
1 7-Dimensional Property Space" 



Martin and Critchlow 
I Points in the 



method 

100% ~ 
D-opt. w/ center nt 

50% 

20% 

S-opt. w/ center pt 

10% 

5% 

random (2.4%) 



size D-score 



762 
762 
381 
152 
762 

76 
146 

76 



37.0 
36.3 
35.0 
30.2 
27.9 
26.7 
21.4 
13.83 



D-fj 


S -score 


S-o 


0.38 


3.94 


0.04 




3.53 




0.85 


3.68 


0.03 


0.94 


3.41 


0.10 




3.92 




1.11 


3.19 


0.11 


1.40 


2.77 


0.12 


1.34 


1.79 


0.35 



never varied. The six-point design is a slightly peered 
pentagon so finally gives some (small) variatfo to t 

XT' H eCOnd ° f P ' 0tS Sh0WS the ^ elected 
by h correspondmg D-optimal designs using a linear model 
The three-point design is an obtuse triangle, and thus varies 
wo dimensions. The four-point design is a flattened tetra- 
hedron and samples all three dimensions. The five-point 
destgn , s a beautifully balanced design consisting of a Lge 
rahedron plus the origin ar its center . six f nft f 
n ngt ar btpyram.d plus center point. Seven points (not 
shown) g, ve an octahedron P h, s the center. Evidently 
because D-optimality sacrifices some spread in order to' 
rmmrmze multicollinearities, it samples all of the dimensions 
of property space eve,! with very few extra degrees of 
f-ee on, Hence, for tailored designs, where the number of 
pent ,s frequently close to the number of dimensions 
D-opt.maJ.ty is a good criterion for "diversity". In these 
cases, D-optimal designs generally have reasonably high 
S-scores, but S-optimal designs often have poor D-scores 
As an analogous test for the 1 7-dimensional space of the 
current study, D-optimal and S-optimal libraries of 1 8 points 
were generated insisting on the point nearest the ceLid 
as the sole fixed requirement in each design. There is no 
way to visualize how well 18 points fill 17 dimensions so 
D-optimal and S-optimal calibration designs were also run 
The results are compiled in Table 8. To visually compare 
he D-scores and S-scores in Figure 4a, they were both 

ottom o? t0 SC0, ' C " rand0m « ^ 

div rl e, ' Slty , yardSt ' Ck " and the P Ure ™«i-nat 

d ve, s ,ty design at the top. The calibration points for 

.andomly reduced candidate sets are indicated and have been 

connected w,th dotted lines to help visually align the two 

scales. The D-optimal design forced to inducted* cerT 

point had a D-score of 36.3, which calibration showed was 

only slightly below the maximum possible value for an 18- 

po.nt design of 37.0 (with no center point requirement). The 

S-opt,mal design requiring the center point had an S-score 

of 3.92, again very comparable to the S-optimal maximum 

el" nT, 0 ? 94 - D -° Pti ' llal deSie(1 With the fi «d 
center point had a respectable S-score of 3.5, roughly 

comparable to an S-optimal design run on 33% of the 

candidates. The S-optimal design with the fixed centroid 

however, had a relatively poor D-score of 27.9, roughly 

equivalent to throwing away 88% of the data at random and 
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Random (6.6%) LI 

' U Random (6.6%) 

D- Score 
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Figure 4. (a) 18-Point "yardsticks" of D-scores and S-scores 100"/ 
refers to the maximally diverse designs. Other percentage re er to 
randomly removing all but that fraction of L ca di ates and 
determining the maximally diverse 18-noint designs The lowe T 

sire oVTrf ? lt 7? ° f ' 8 POmtS - T ' le ""^ c2j££ 
b score of a D-opt,mal design and the D-score of an S-optimal 
design, (b) 50-Point "yardsticks" of D-scores and S-scores 100% 
refers to the maximally diverse designs. Other percent g s refer to 
randomly removing all but that fraction of me caStes ,nd 
determining the maximally diverse 50 point designs T «S 

desL P g " a " d th£ D - SCore of an S -°P tin «l 

5j£ l0W6r arT ° WS are simiIar comparisons for tailored 

performing D-optimal design on the remaining 12% Nev- 
ertheless it is still much better than random selection 
(keeping 2.4%), with a D-score of only 1 4. 
Principal components analysis was performed to test the 

deZTf Pr menSi0na,ity ° f tHe d6SignS - For the D -°P fi ™l 
design, 15 PCs were required to cover 99% of the variance 

so the 18 points could be said to sample about 15 of 17 

possible dimensions. For the S-optimal design only 12 of 

the 17 PCs were required to cover 99% of the variance so 

by this measure, it captured three fewer dimensions than 

D-ophmal design. Finally, the average correlation coefficient 

between variables for D-optimal design was 0.29, but was 

in 1°q th t S T ,nial design ' showin S grater collinearities 
in the S-opt,m a | design. Hence, D-optimal design sacrificed 
a small amount of spread (redundancy) but did a better job 
ot covering property space. 
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Figure 4b uses the calibration data from Table 4, scaled 

Su t0 l^ 0 " 1 " 31 ' 6 S -° pnmal and based 
a.lo.ed designs of 50 points. The S-score for the D-optimal 
based ta.lored design was 2.3, very comparable to the score 
of 2.5 for the actual S-optimal based tailored design from 
the same bin profile. These values are comparable, respec 
tively, t0 S el ec t ing 17% or 13% of the "all" set at random 
before performing a pure S-optimal design. However, the 
D-score for the S-optimal based tailored design was only 
77.5 compared to 96 for the D-optimal based tailored library 
As Tab e 4 shows, this is equivalent to eliminating all but 
9/4 of the candidates vs 15%, respectively. A random design 
with a score of 50, corresponds to 6.6%, so the D-score for 
the S-opdmal based tailored design is approaching random 
Hence, e D-optimal based tailored design has a decent 
S-score, but the S-optimal based tailored design has a poor 
D-score, even ,„ this tailored library with almost three times 
as many points as the number of dimensions. 

As the number of points exceeds the number of dimen- 
sions, D-opt.mal design will eventually suggest resampling 
some pomts. In the three-dimensional example above thif 
happened at n,ne points. This indicates that more points were 
requested than are required to estimate a linear model, so 
higher order model terms could be added to the model A 
useful rule of thumb is to add cross terms from the higher 
principal components until the S-score indicates that re- 
sampling has been prevented. An alternative approach is to 
use Bayesian optimal design", which automatically adds 
some weight to all of the cross terms." In practice, adding 
enough cross terms to saturate the model generally work! 
well for tailored designs (see Methods above). 

It should be reiterated that this analysis was specifically 
a.med at problems where part of the design is preselected 
and the number of points ,s not much larger than the number 
of dimensions. For other problems, such as selecting a subset 
from a corporate archive of hundreds of thousands of 
compounds m a space of only five or six dimensions, other 
methods would be preferred, such as sampling from cells in 
close packed lattices. An additional limitation of D-optimal 
design ,s that the scores can only be compared between 
l.branes of the same size. S-optimal scores can be compared 
between libraries of different sizes so they are useful for 
initial studies of the appropriate number of substituents 

Evaluation of the Library. The goal of th.s library design 
example was to provide high structural diversity while 
constraining pertinent physicochemical properties to suitable 
ranges for small molecule drugs. To examine this, Figure 
5a-d presents histograms and quantile box plots for three 
sets of substituents: the full set of 756 useful amines, the 
final tailored design of 50 compounds with a diversity score 
of 96 (see above), and the maximally diverse D-optimal 
design of 50 compounds with no bias at all which had a 
diversity score of 136 (see Table 4). Distributions are 
presented for four properties: molecular weight, calculated 
log P, number of rotatable bonds, and distance from the 
center of property space. Tables 9-12 give the corresponding 
quantiles, means, and numbers of observations. The boxes 
m the box plots" indicate the 25, 50, and 75 percentiles 
The diamonds depict the means and standard deviations 
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C. Simple 
D*optimal 
design of 
50 amines 
with no 
tailoring. 



b) Molecular we.gh.s for three designs of 50 amines each H sto« am ind ZmiM ^ , d, , VerS " y " <Q uanti,es are in Table 9.) 

the extremes of molecular size as does a simple D-opt.mal <<d verf ^ des,, O n * " t°7 the !ailored desi 8» doe ^ ™» favo 
designs of 50 amines each. Histograms and quantile box nlote **™T\ ( ? Um 11 es are 1,1 ™* "<»•) («) Values of log Kow for three 
low hpoph licity as does the simple D-optimaf ^ L fOu St r^'f ?J? n °* em " hasi2e extremes of h^gh and 
des,gn S of 50 amines each. Histograms and quantile box plots how, £ "Z S„4 i T , ] (d) Number ° f mmbk bonds 'hree 
the s.mple D-opfmal "diversity" design. (Quantiles are n Tabled ) d ° es mH emphasi2e nexible substiluents as does 
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SJi'H F /' qUency Dis fributions of Distances from the 
Centroid of Property Space 0 



quantile, % 

iooo 

99.5 

97.5 

90.0 

75.0 

50.0 

25.0 

10.0 

2.5 

0.5 

0.0 

mean 

N 



all 



tailored 



max div 



5.38 
5.20 
4.73 
3.96 
3.57 
3.23 
2.80 
2.48 
2.22 
2.15 
2.09 
3.24 
756 



5.30 
5.27 
4.23 
3.97 
3.44 
3.15 
2.86 
2.42 
2.36 

3.54 
50 



5.38 
5.36 
5.07 
4.75 
3.86 
3.47 
3.08 
2.72 
2.64 

4.03 
50 



S x div " t ^ tail ° red dCSign from < his s ^y. Co un 
sutLts «™ ^ *i of 50 



Table 10. Molecular Weight 
quantile, % 



Frequency Distributions 



all 



tailored 



100.0 

99.5 

97.5 

90.0 

75.0 

50.0 

25.0 

1 0.0 

2.5 

0.5 

0.0 

mean 

N 



max div 



249.36 
246.71 
232.44 
206.29 
183.57 
153.18 
121.18 
97.15 
71.12 
45.08 
31.06 
152 
756 



241.46 
238.96 
216.87 
200.29 
137.18 
114.19 
76.31 
63.13 
60.10 

150 
50 



249.36 
249.32 
244.22 
217.82 
172.77 
128.95 
74.24 
34.36 
31.06 

168 
50 



Table II. CLOGP Frequency 
quantile, % a j| 



Distributions 



100.0 

99.5 

97.5 

90.0 

75.0 

50.0 

25.0 

1 0.0 

2.5 

0.5 

0.0 

mean 

N 



tailored 



max div 



6.63 
5.36 
3.45 
2.54 
1.92 
1.04 
-0.06 
-0.92 
-2.12 
-4.38 
-4.76 
0.89 
756 



6.63 
5.79 
2.52 
1.08 
0.10 
-0.99 
-1.58 
-3.63 
-4.19 

0.18 
50 



6.45 
6.38 
3:33 
2.20 
0.73 
-1.06 
-3.31 
-4.74 
-4.76 

0.48 
50 



Additional tick marks are the other quantiles listed in Tables 
8 11. The distributions of the full candidate set represent 
the expected distributions of the random sets, which had an 
average diversity score of only 50 (see Table 4). 

Concern has been raised that pure D-optimal designs (as 
well as other pure diversity designs) sample only the "outer 
edges" of property space* The radial distributions in Figure 
5a and Table 9 show that, for better or worse, the pure 
D-optimal d,versity set does indeed oversample the extremes 
of property space relative to the original distribution The 
tailored design shows only a modest outward shift relative 
to the candidates. Apparently, the constraints of sampling 
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Table ,2. Frequency Distributions for Counts of Rotatable 



quantile, % 



tailored 



100.0 

99.5 

97.5 

90.0 

75.0 

50.0 

25.0 

10.0 

2.5 

0.5 

0.0 

mean 

N 



max div 



14 

12 
8 
5 
4 
2 
1 
I 

0 
0 
0 

2.74 
756 



12 
1 I 

7 
3 
2 
i 
I 

0 
0 

2.76 
50 



14 
13.5 
9 
5 
3 
I 

0 
0 
0 

3.44 
50 



from property bins counteracts the D-optimal algorithm's 
p.-penstty to sample mamly remote regions o^ope^ 

The histograms and quantile boxes of the three properties 
m Figure 5 b-d show that the pure diversity set distributions 
are relatively broad and flat and include most of the highes 
and lowest values of each property as shown in TablesTo 

• Recall that CLOGP ,s actually one of the 19 dimensions 

I^IITT SpaCC - M ° leCU,ar wei * ht and n^ber of 
otatable bonds were not specifically included in the property 
space calculations, but they are indirectly included thrSuS 
correlations wtth topological indices, so extremes of proper? 
space might well imply extremes of these properties as we7 
Examining the property histograms shows that the pure 
d versity design emphasized large flexible groups with either 
extremely high or extremely low lipophilicity Orally avail- 
able drugs tend to be small, rigid compounds with intermedi- 
ate hpophihcity, so pure diversity designs bias libraries away 
from ideal drug properties. The tailored library's property 
dotations have wide tails but are not as flat and extreZ 
as the pure diversity designs. This is understandable, since 
only a few member, from the extreme bins were permitted 
in this design. It us more hydrophilic than the original 
distribution: including a few extreme values, but concentrat- 
ing most of the members in the desirable moderately 
hydiophihe region. About 75% of the substituents in the 
tai ored set have three or fewer rotatable bonds versus four 
in the original distribution and five ,n the pure diversity set 
showing that tailoring has limited the fraction of flexible' 
^bstituents. The median (50%) molecular weight in th 
lore design ,s lower than the original distribution and 
much lower than the pure diversity design, but there is a 
curious bimodal distribution with peaks at about 130 and 
200. The pure diversity design has an extremely top heavy 
molecular weight distribution, with the most frequent value 
m the histogram actually being the highest MW slice, which 
had a very low original frequency. Since the low molecular 
weight bin was strongly emphasized in the tailored profile 
this suggests that structural diversity requires complexity' 
and complexity requires mass. Recall that the molecular 
weigh cutoff for the low MW bin was 130. The diversity 
algonthm emphasizes the heaviest members available in each 

MW Tt diStnUtl0n 1S a,iaSing from the two ^screte 
MW cutoffs: the peak near 130 from sampling the low MW 

bins and the other at the highest values fromsamp ling the 
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Conclusion 

"Wanes . Pure d.versity designs, however, were found tn 

w at th ese names suggest; des . gn . g • Co ^ 

broad screenmg requires a combination of property call* 

If this much tailoring is useful even in broad screening 
designs, how much more n„ if h„ , ■ sc| eenmg 

as Tn J 1 I arc genera,ily and sim P'i^ty. As long 

c~l c " lqUe t0 S "™ lta " eo -'y optimize addit,o„al 
SSSi n fo 12 °: SynthetiC diffiCUlty al °"S -th the 

. approach is rigorous and highly automate the^ 
Chat f ,r 0US ; meria Sti " Sub J ective ' - the inS 

n^ity and other de Slgn cri J ia . Ba.atm 1 e « 
benefits from art, experience, and the clarity of "hands on" 

r "r an from a comp,ex ^jr: . 

b oLtt tnin^ K y r Ct,Ci " 8 medicinal che ™s and 
While this work criticizes pure diversity designs if ^ 

one of many .mportant factors, some of which are difficuh 
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